374 research outputs found
A Survey of FPGA Optimization Methods for Data Center Energy Efficiency
This article provides a survey of academic literature about field
programmable gate array (FPGA) and their utilization for energy efficiency
acceleration in data centers. The goal is to critically present the existing
FPGA energy optimization techniques and discuss how they can be applied to such
systems. To do so, the article explores current energy trends and their
projection to the future with particular attention to the requirements set out
by the European Code of Conduct for Data Center Energy Efficiency. The article
then proposes a complete analysis of over ten years of research in energy
optimization techniques, classifying them by purpose, method of application,
and impacts on the sources of consumption. Finally, we conclude with the
challenges and possible innovations we expect for this sector.Comment: Accepted for publication in IEEE Transactions on Sustainable
Computin
Platform-Aware FPGA System Architecture Generation based on MLIR
FPGA acceleration is becoming increasingly important to meet the performance
demands of modern computing, particularly in big data or machine learning
applications. As such, significant effort is being put into the optimization of
the hardware accelerators. However, integrating accelerators into modern FPGA
platforms, with key features such as high bandwidth memory (HBM), requires
manual effort from a platform expert for every new application. We propose the
Olympus multi-level intermediate representation (MLIR) dialect and Olympus-opt,
a series of analysis and transformation passes on this dialect, for
representing and optimizing platform aware system level FPGA architectures. By
leveraging MLIR, our automation will be extensible and reusable both between
many sources of input and many platform-specific back-ends.Comment: Accepted for presentation at the CPS workshop 2023
(http://www.cpsschool.eu/cps-workshop
Enabling Automated Bug Detection for IP-based Designs using High-Level Synthesis
Modern System-on-Chip (SoC) architectures are increasingly composed of Intellectual Property (IP) blocks, usually designed and provided by different vendors. This burdens system designers with complex system-level integration and verification. In this paper, we propose an approach that leverages HLS techniques to automatically find bugs in designs composed of multiple IP blocks. Our method is particularly suitable for industrial adoption because it works without exposing sensitive information (e.g., the design specification or the component generation process). This advocates the definition and the adoption of an interoperable format for cross-vendor hardware bug detection
Performance Estimation of Task Graphs Based on Path Profiling
Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5 % in average, even when compiling the code with different optimization levels
Iris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization
Optimizing data movements is becoming one of the biggest challenges in
heterogeneous computing to cope with data deluge and, consequently, big data
applications. When creating specialized accelerators, modern high-level
synthesis (HLS) tools are increasingly efficient in optimizing the
computational aspects, but data transfers have not been adequately improved. To
combat this, novel architectures such as High-Bandwidth Memory with wider data
busses have been developed so that more data can be transferred in parallel.
Designers must tailor their hardware/software interfaces to fully exploit the
available bandwidth. HLS tools can automate this process, but the designer must
follow strict coding-style rules. If the bus width is not evenly divisible by
the data width (e.g., when using custom-precision data types) or if the arrays
are not power-of-two length, the HLS-generated accelerator will likely not
fully utilize the available bandwidth, demanding even more manual effort from
the designer. We propose a methodology to automatically find and implement a
data layout that, when streamed between memory and an accelerator, uses a
higher percentage of the available bandwidth than a naive or HLS-optimized
design. We borrow concepts from multiprocessor scheduling to achieve such high
efficiency.Comment: Accepted for presentation at ASPDAC'2
Bridging the Gap between Software and Hardware Designers Using High-Level Synthesis
Modern Systems-on-Chip (SoC) architectures and CPU+FPGA computing platforms are moving towards heterogeneous systems featuring an increasing number of hardware accelerators. These specialized components can deliver energy-efficient high performance, but their design from high-level specifications is usually very complex. Therefore, it is crucial to understand how to design and optimize such components to implement the desired functionality. This paper discusses the challenges between software programmers and hardware designers, focusing on the state-of-the-art methods based on high-level synthesis (HLS). It also highlights the future research lines for simplifying the creation of complex accelerator-based architectures
Dataflow Computing with Polymorphic Registers
Heterogeneous systems are becoming increasingly popular for data processing. They improve performance of simple kernels applied to large amounts of data. However, sequential data loads may have negative impact. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high speed, parallel access to performance-critical data. Furthermore, by PRF customization, specific data path features are exposed to the programmer in a very convenient way. PRFs allow additional control over the registers dimensions, and the number of elements which can be simultaneously accessed by computational units. This paper shows how PRFs can be integrated in dataflow computational platforms. In particular, starting from an annotated source code, we present a compiler-based methodology that automatically generates the customized PRFs and the enhanced computational kernels that efficiently exploit them
The Case for Polymorphic Registers in Dataflow Computing
Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for 9×9
or larger mask sizes, even in bandwidth-constrained systems
ASSURE: RTL Locking Against an Untrusted Foundry
Semiconductor design companies are integrating proprietary intellectual
property (IP) blocks to build custom integrated circuits (IC) and fabricate
them in a third-party foundry. Unauthorized IC copies cost these companies
billions of dollars annually. While several methods have been proposed for
hardware IP obfuscation, they operate on the gate-level netlist, i.e., after
the synthesis tools embed the semantic information into the netlist. We propose
ASSURE to protect hardware IP modules operating on the register-transfer level
(RTL) description. The RTL approach has three advantages: (i) it allows
designers to obfuscate IP cores generated with many different methods (e.g.,
hardware generators, high-level synthesis tools, and pre-existing IPs). (ii) it
obfuscates the semantics of an IC before logic synthesis; (iii) it does not
require modifications to EDA flows. We perform a cost and security assessment
of ASSURE.Comment: Submitted to IEEE Transactions on VLSI Systems on 11-Oct-2020,
28-Jan-202
Optimizing the Use of Behavioral Locking for High-Level Synthesis
The globalization of the electronics supply chain requires effective methods
to thwart reverse engineering and IP theft. Logic locking is a promising
solution, but there are many open concerns. First, even when applied at a
higher level of abstraction, locking may result in significant overhead without
improving the security metric. Second, optimizing a security metric is
application-dependent and designers must evaluate and compare alternative
solutions. We propose a meta-framework to optimize the use of behavioral
locking during the high-level synthesis (HLS) of IP cores. Our method operates
on chip's specification (before HLS) and it is compatible with all HLS tools,
complementing industrial EDA flows. Our meta-framework supports different
strategies to explore the design space and to select points to be locked
automatically. We evaluated our method on the optimization of differential
entropy, achieving better results than random or topological locking: 1) we
always identify a valid solution that optimizes the security metric, while
topological and random locking can generate unfeasible solutions; 2) we
minimize the number of bits used for locking up to more than 90% (requiring
smaller tamper-proof memories); 3) we make better use of hardware resources
since we obtain similar overheads but with higher security metric.Comment: Accepted for publication in IEEE Transactions on Computer-Aided
Design of Integrated Circuits and System
- …